Method and System for Devising an Optimum Control Policy

ABSTRACT

A method for devising an optimum control policy of a controller for controlling a system includes optimizing at least one parameter that characterizes the control policy. A Gaussian process model is used to model expected dynamics of the system. The optimization optimizes a cost function which depends on the control policy and the Gaussian process model with respect to the at least one parameter. The optimization is carried out by evaluating at least one gradient of the cost function with respect to the at least one parameter. For an evaluation of the cost function a temporal evolution of a state of the system is computed using the control policy and the Gaussian process model. The cost function depends on an evaluation of an expectation value of a cost function under a probability density of an augmented state at time steps.

This application claims priority under 35 U.S.C. § 119 to patentapplication no. DE 10 2018 202 431.6, filed on Feb. 16, 2018 in Germany,the disclosure of which is incorporated herein by reference in itsentirety.

BACKGROUND

PID control architectures are widely used in industrial applications.Despite their low number of open parameters, tuning multiple, coupledPID controllers can become tedious in practice.

The publication “PILCO: A Model-Based and Data-Efficient Approach toPolicy Search”, Marc Peter Deisenroth, Carl Edward Rasmussen, 2011,which can be accessed athttp://www.icml-2011.org/papers/323_icmlpaper.pdf discloses amodel-based policy search method.

SUMMARY

The method with the features disclosed herein has the advantage that itrenders PID tuning possible as the solution of a finite horizon optimalcontrol problem is possible without further a priori knowledge.

Proportional, Integral and Derivative (PID) control structures are stilla widely used control tool in industrial applications, in particular inthe process industry, but also in automotive applications and inlow-level control in robotics. The large share of PID controlledapplications is mainly due to the past record of success, the wideavailability, and the simplicity in use of this technique. Even inmultivariable systems, PID controllers can be employed.

Exploring the mathematics behind the disclosure, it is possible toconsider discrete time dynamic systems of the form

x _(t+1) =f(x _(t) ,u _(t))+∈_(t)  (1)

with continuously valued state x_(t)∈

^(D) as well as continuously valued input u_(t)∈

^(F). The system dynamics f is not known a priori. One may assume afully measurable state, which is corrupted by zero-mean independent andidentically distributed (i.i.d.) Gaussian noise, i.e ∈_(t)˜

(0, Σ_(∈)).

One specific reinforcement learning formulation aims at minimizing theexpected cost-to-go given by

J=Σ _(t=0) ^(T)

[c(x _(t) ,u _(t) ;t)],x ₀˜

(μ₀,Σ₀)  (2)

where an immediate, possibly time dependent cost c(x_(t), u_(t); t)penalizes undesired system behavior. Policy search methods optimize theexpected cost-to-go J by selecting the best out of a range of policiesu_(t)=π(x_(t); θ) parametrized by θ. A model {circumflex over (f)} ofthe system dynamics f is utilized to predict the system behavior and tooptimize the policy.

In a first aspect, the disclosure therefore relates to a method fordevising an optimum control policy π of a controller, especially a PIDcontroller, for controlling a (physical) system, said method comprisingoptimizing at least one parameter θ that characterizes said controlpolicy π, wherein a Gaussian process model {circumflex over (f)} is usedto model expected dynamics of the system, if the system is acted upon bysaid PID controller, wherein said optimization optimizes a cost functionJ which depends on said control policy π and said Gaussian process model{circumflex over (f)} with respect to said at least one parameter θ,wherein said optimization is carried out by evaluating at least onegradient of said cost function J with respect to said at least oneparameter θ, wherein for an evaluation of said cost function J atemporal evolution of a state x_(t) of the system is computed using saidcontrol policy π and said Gaussian process model, wherein said costfunction J depends on an evaluation of an expectation value of a costfunction c under a probability density of an augmented state z_(t) atpredefinable time steps t.

The control output of a scalar PID controller is given by

u _(t) =K _(p) e _(t) +K ₁∫₀ ^(t) e _(τ) dτ+K _(d) ė _(t)  (3)

e _(t) =x _(des,t) −x _(t)  (4)

The current desired state x_(des,t) can be either a constant set-pointor a time-variable goal trajectory. A PID controller is agnostic to thesystem dynamics and depends only on the system's error. Each controlleris parametrized by its proportional, integral and derivative gainθ_(PID)=(K_(p), K_(i), K_(d)). Of course, some of these gains may be setfixed to zero, yielding e.g. a PD controller in the case of K_(I)=0.

A general PID control structure C(s) for MIMO (multi input multi output)processes can be described in transfer function notation by a D×Ftransfer function matrix

$\begin{matrix}{{C(s)} = \begin{bmatrix}{c_{11}(s)} & \ldots & {c_{1D}(s)} \\\vdots & \ddots & \vdots \\{c_{F\; 1}(s)} & \ldots & {c_{FD}(s)}\end{bmatrix}} & (5)\end{matrix}$

where s denotes the complex Laplace variable and c_(ij)(s) are of PIDtype. The multivariate error is given by e_(t)=x_(des,t)−x_(t)∈

^(D) such that the multivariate input becomes u(s)=C(s)e(s).

FIG. 1 shows a humanoid robot 1 balancing an inverted pendulum 2. Usingthe disclosure, coupled PID and PD controllers were successfully trainedto stabilize the pole in the central, upright position without requiringa priori system knowledge.

We present a sequence of state augmentations such that any multivariablePID controller as given by equation (5) can be represented as aparametrized static state feedback law. A visualization of the stateaugmentation integrated into the one-step-ahead prediction is shown inFIG. 2, in comparison with the standard PILCO setting. All lines linkingthe blocks {tilde over (z)}_(t), z_(t+1) are absent in the standardPILCO setting.

Given a Gaussian distributed initial state x₀ the resulting predictedstates will remain Gaussian for the presented augmentations.

To obtain the required error states for each controller given byequation (3), it is possible to define an augmented system state z_(t)that may also track of the error at the previous time step and theaccumulated error,

z _(t):=(x _(t) ,e _(t−1) ,ΔTΣ _(τ=0) ^(t−1) e _(τ))  (6)

where ΔT is the system's sampling time.

For simplicity, vectors are denoted as tuples (v₁, . . . , v_(n)) wherev_(i) may be vectors themselves. The following augmentations can be madeto obtain the necessary policy inputs:

The augmented state z_(t) and/or the desired state x_(des,t) (set-pointor target trajectory) may be selected as independent Gaussian randomvariables, i.e.

$\begin{matrix}{\begin{bmatrix}z_{t} \\x_{{des},t}\end{bmatrix} = {\left( {\begin{bmatrix}\mu_{z} \\\mu_{{des},t}\end{bmatrix},\begin{bmatrix}\Sigma_{z} & 0 \\0 & \Sigma_{dest}\end{bmatrix}} \right)}} & (7)\end{matrix}$

Drawing the desired state x_(des,t) from a Gaussian distribution yieldsimproved generalization to unseen targets.

The current error is a linear function of z_(t) and x_(des,t). Thecurrent error derivative and integrated error may conveniently beapproximated by

$\begin{matrix}{\overset{.}{e} \approx \frac{e_{t} - e_{t - 1}}{\Delta \; T}} & (8) \\{{\int_{\tau = 0}^{t}{e_{\tau}d\; \tau}} \approx {{\Delta \; T{\sum\limits_{\tau = 0}^{t - 1}e_{\tau}}} + {\Delta \; {{Te}_{t}.}}}} & (9)\end{matrix}$

Both approximations are linear transformations of the augmented state.The resulting augmented state distribution remains Gaussian as it is alinear transformation of a Gaussian random variable

This aspect of the disclosure can readily be extended to incorporate alow-pass filtered error derivative. In this case, additional historicerror states would be added to the state z_(t) to provide the input fora low-pass Finite Impulse Response (FIR) filter. This reducesmeasurement noise in the derivative error.

A fully augmented state {tilde over (z)}_(t) is then conveniently givenby

$\begin{matrix}{{\overset{\sim}{z}}_{t}:=\left( {z_{t},x_{{des},t},e_{t},\frac{e_{t} - e_{t - 1}}{\Delta \; T},{\Delta \; T{\sum\limits_{\tau = 0}^{t}e_{\tau}}}} \right)} & (10)\end{matrix}$

Based on the fully augmented state {tilde over (z)}_(t), the PID controlpolicy for multivariate controllers can be expressed as a static statefeedback policy

$\begin{matrix}\begin{matrix}{u_{t} = {A_{PID}\left( {{\overset{\sim}{z}}_{t}^{(3)},{\overset{\sim}{z}}_{t}^{(5)},{\overset{\sim}{z}}_{t}^{(4)}} \right)}} \\{= {{A_{PID}\left( {e_{t},,{\Delta \; T{\sum\limits_{\tau = 0}^{t}e_{\tau}}},\frac{e_{t} - e_{t - 1}}{\Delta \; T}} \right)}.}}\end{matrix} & (11)\end{matrix}$

The specific structure of the multivariate PID control law is defined bythe parameters in A_(PID). For example, PID structures as shown in FIG.3 may be represented by

$\begin{matrix}{{A_{a)} = \begin{bmatrix}K_{p,1} & 0 & K_{i,1} & 0 & K_{d,1} & 0 \\0 & K_{p,2} & 0 & K_{i,2} & 0 & K_{d,2}\end{bmatrix}},} & (12) \\{A_{b)} = {\begin{bmatrix}K_{p,1} & K_{p,2} & K_{i,1} & K_{i,2} & K_{d,1} & K_{d,2}\end{bmatrix}.}} & (13)\end{matrix}$

Given the Gaussian distributed augmented state and control input asderived above, the next augmented state may be computed using the GPdynamics model {circumflex over (f)}. It is possible to approximate thepredictive distribution p(x_(t+1)) by a Gaussian distribution usingexact moment matching. From the dynamics model output x_(t+1) and thecurrent error stored in the fully augmented state z_(t), the next statemay be obtained as

z _(t+1)=(x _(t+1) ,{tilde over (z)} _(t) ⁽³⁾ ,{tilde over (z)} _(t)⁽⁵⁾)=(x _(t+1) ,e _(t) ,ΔTΣ _(τ=0) ^(t) e _(τ))  (14).

Iterating (6) to (14), a long-term prediction can be computer over aprediction horizon H as illustrated in FIG. 2. For the initial augmentedstate, one may conveniently define

z ₀:=(x ₀ ,x _(des,0) −x ₀,0).  (15)

Given the presented augmentation and propagation steps, the expectedcost gradient can be computed analytically such that the policy π can beefficiently optimized using gradient-based methods.

The expected cost derivative may be obtained as

$\begin{matrix}{\frac{dJ}{d\; \theta} = {{\sum\limits_{t = 1}^{H}{\frac{d}{d\; \theta}{_{z_{t}}\left\lbrack {c\left( z_{t} \right)} \right\rbrack}}} = {\sum\limits_{t = 1}^{H}{\frac{d\; ɛ_{t}}{{dp}\left( z_{t} \right)}{\frac{{dp}\left( z_{t} \right)}{d\; \theta}.}}}}} & (16)\end{matrix}$

Here, we denoted ε_(t)=

_(z) _(t) [c(z_(t))] and we write dp(z_(t)) to denote the sufficientstatistics derivatives dμ_(t) and dΣ_(t) of a Gaussian random variablep(z_(t))˜

(μ_(t), Σ_(t)). The gradient of the immediate loss with respect to theaugmented state distribution, dε_(t)/dp(z_(t)) is readily available forcost functions like quadratic or saturated exponential terms andGaussian input distributions.

The gradient for each predicted augmented state in the long-term rolloutmay be obtained by applying the chain rule to (14) resulting in

$\begin{matrix}{\frac{{dp}\left( z_{t + 1} \right)}{d\; \theta} = {{\frac{\partial{p\left( z_{t + 1} \right)}}{\partial{p\left( {\overset{\sim}{z}}_{t} \right)}}\frac{{dp}\left( {\overset{\sim}{z}}_{t} \right)}{d\; \theta}} + {\frac{\partial{p\left( z_{t + 1} \right)}}{\partial{p\left( x_{t + 1} \right)}}{\frac{{dp}\left( x_{t + 1} \right)}{d\; \theta}.}}}} & (17)\end{matrix}$

The derivatives

$\frac{\partial{p\left( z_{t + 1} \right)}}{\partial{p\left( {\overset{\sim}{z}}_{t} \right)}}\mspace{14mu} {and}\mspace{14mu} \frac{\partial{p\left( z_{t + 1} \right)}}{\partial{p\left( x_{t + 1} \right)}}$

may be computed for the linear transformation in equation (14) accordingto the general rules for linear transformations on Gaussian randomvariables.

The gradient of the dynamics model output x_(t+1) is given by

$\begin{matrix}{\frac{{dp}\left( z_{t + 1} \right)}{d\; \theta} = {{\frac{\partial{p\left( z_{t + 1} \right)}}{\partial{p\left( {\overset{\sim}{z}}_{t} \right)}}\frac{{dp}\left( {\overset{\sim}{z}}_{t} \right)}{d\; \theta}} + {\frac{\partial{p\left( x_{t + 1} \right)}}{\partial{p\left( u_{t} \right)}}{\frac{{dp}\left( u_{t} \right)}{d\; \theta}.}}}} & (18)\end{matrix}$

Applying the chain rule for the policy output p(u_(t)) yields

$\begin{matrix}{\frac{{dp}\left( u_{t} \right)}{d\; \theta} = {{\frac{\partial{p\left( u_{t} \right)}}{\partial{p\left( {\overset{\sim}{z}}_{t} \right)}}\frac{{dp}\left( {\overset{\sim}{z}}_{t} \right)}{d\; \theta}} + {\frac{\partial{p\left( u_{t} \right)}}{\partial\theta}.}}} & (19)\end{matrix}$

The derivatives

$\frac{\partial{p\left( u_{t} \right)}}{\partial{p\left( {\overset{\sim}{z}}_{t} \right)}}\mspace{14mu} {and}\mspace{14mu} \frac{\partial{p\left( u_{t} \right)}}{\partial\theta}$

are introduced by the linear control law given by equation (11) and canbe computed according to the general rules for linear transformations onGaussian random variables. The gradient of the fully augmented state{tilde over (z)}_(t) is given by

$\begin{matrix}{\frac{{dp}\left( {\overset{\sim}{z}}_{t} \right)}{d\; \theta} = {\frac{{dp}\left( {\overset{\sim}{z}}_{t} \right)}{d\; {p\left( z_{t} \right)}}{\frac{{dp}\left( z_{t} \right)}{d\; \theta}.}}} & (20)\end{matrix}$

may be computed for the linear transformation given by equation (10).Starting from an initial augmented state z₀ where

${\frac{{dp}\left( z_{0} \right)}{d\; \theta} = 0},$

it is possible to obtain gradients for all augmented states z_(t) withrespect to the policy parameters θ, dp(z_(t))/dθ by iteratively applyingequations (17) to (20) for all time steps t.

The disclosure is also directed to a computer program. The computerprogram product comprises computer-readable instructions stored on anon-transitory machine-readable medium that are executable by a computerhaving a processor for causing the processor to perform the operationslisted herein.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the disclosure will be apparentfrom the following detailed descriptions of the various aspects of thedisclosure in conjunction with reference to the following drawings,where:

FIG. 1 is an illustration of a humanoid robot trained with a systemaccording to the disclosure;

FIG. 2 is a schematic illustration of the mathematics behind the methodaccording to an aspect of the disclosure;

FIG. 3 is a block diagram depicting control structures of a controllerto which the disclosure may be applied;

FIG. 4 is a block diagram depicting components of a system according toan aspect of the disclosure;

FIG. 5 is a block diagram depicting components of a system according toanother aspect of the disclosure;

FIG. 6 is a flowchart diagram depicting the method according to oneaspect of the disclosure.

DETAILED DESCRIPTION

FIG. 4 shows a block diagram depicting components of a system accordingto an aspect of the disclosure. Shown is a control system 40, whichreceives sensor signals S from a sensor 30 via an input unit 50. Thesensor senses a state of a physical system 10, e.g. a robot (like e.g.the humanoid robot 1 shown in FIG. 1, or an at least partiallyself-driving car), or more generally an actuator (like e.g. a throttlevalve), in an environment 20. The input unit 50 transforms theses sensorsignals S into a signal representing said state x. For example, theinput unit 50 may copy the sensor signal S into a predefined signalformat. If the sensor signal S is in a suitable format, the input unit50 may be omitted altogether.

This signal representing state x is then passed on to a controller 60,which may, for example, be given by a PID controller. The controller isparameterized by parameters θ, which the controller 60 may receive froma parameter storage P. The controller 60 computes a signal representingan input signal u, e.g. via equation (11). This signal is then passed onto an output unit 80, which transforms the signal representing the inputsignal u into an actuation signal A, which is passed on to the physicalsystem 10, and causes said physical system 10 to act. Again, if theinput signal u is in a suitable format, the output unit may be omittedaltogether.

The controller 60 may be controlled by software which may be stored on amachine-readable storage medium 45 and executed by a processor 46. Forexample, said software may be configured to compute the input signal uusing the control law given by equation (11).

FIG. 5 shows a block diagram depicting a training system 140, which maybe configured to train the control system 40. The training system 140may comprise an input unit 150 for receiving signals representing aninput signal u and a state signal x, which are then passed on to a block190 which receives present parameters θ from parameter storage P andcomputes new parameters θ′. These new parameters θ′ are then passed onto parameter storage P to replace present parameters θ. The block 190may be operated by software which may be stored on a machine-readablestorage medium 210 and executed by a processor 200. For example, block190 may be configured to execute the steps of the method shown in FIG.6.

FIG. 6 is a flowchart diagram depicting a method for devising optimumparameters θ for an optimum control policy π of controller 60.

First (1000), a random policy is devised, e.g. by randomly assigningvalues for parameters θ and storing them in parameter storage P. Thecontroller 60 then controls physical system 10 by executing its controlpolicy π corresponding to these random parameters θ. The correspondingstate signals x are recorded and passed on to block 190.

Next (1010), a GP dynamics model {circumflex over (f)} is trained usingthe recorded signals x and u to model the temporal evolution of thesystem state x, x_(t+1)={circumflex over (f)}(x_(t), u_(t)).

Then (1020), a roll-out of the augmented system state z_(t) over ahorizon H is computed based on the GP dynamics model {circumflex over(f)}, the present parameters θ and the corresponding control policy π(θ)and the gradient of the cost function J w.r.t. to parameters θ iscomputed, e.g. by equations (17)-(20).

Based on these gradients, new parameters θ′ are computed (1030). Thesenew parameters θ′ replace present parameters θ in parameter storage P.

Next, it is checked whether the parameters θ have converged sufficiently(1040). If it is decided that they have not, the method iterates back tostep 1020. Otherwise, the present parameters θ are selected as optimumparameters θ* that minimize the cost function J (1050).

Controller 60 is then executed with a control policy π corresponding tothese optimum parameters θ* to control the physical system 10. The inputsignal u and the state signal x are recorded (1060).

The GP dynamics model {circumflex over (f)} is then updated (1070) usingthe recorded signals x and u.

Next, it is checked whether the GP dynamics model {circumflex over (f)}has sufficiently converged (1080). This convergence can be checked e.g.by checking the convergence of the log likelihood of the measured datax, t, which is maximized by adjusting the hyperparameters of the GP,e.g. with a gradient-based method. If it is deemed not to have beensufficiently converged, the method branches back to step 1020.Otherwise, the present optimum parameters θ* are selected as parametersθ that will be used to parametrize the control policy π of controller60. This concludes the method.

Parts of this disclosure have been published as “Model-Based PolicySearch for Automatic Tuning of Multivariate PID Controllers”,arXiv:1703.02899v1, 2017, Andreas Doerr, Duy Nguyen-Tuong, Alonso Marco,Stefan Schaal, Sebastian Trimpe, which is incorporated herein byreference in its entirety.

What is claimed is:
 1. A method for devising an optimum control policyof a controller for controlling a system, said method comprising:optimizing at least one parameter that characterizes said controlpolicy; using a Gaussian process model to model expected dynamics of thesystem, wherein said optimization optimizes a cost function whichdepends on said control policy and said Gaussian process model withrespect to said at least one parameter; and carrying out saidoptimization by evaluating at least one gradient of said cost functionwith respect to said at least one parameter, wherein for an evaluationof said cost function a temporal evolution of a state of the system iscomputed using said control policy and said Gaussian process model, andwherein said cost function depends on an evaluation of an expectationvalue of a cost function under a probability density of an augmentedstate at time steps.
 2. The method according to claim 1, wherein saidaugmented state at a given time step comprises the state at said giventime step.
 3. The method according to claim 1, wherein said augmentedstate at a given time step comprises an error between the state and adesired state at a previous time step.
 4. The method according to claim1, wherein said augmented state at a given time step comprises anaccumulated error of a previous time step.
 5. The method according toclaim 3, wherein the augmented state and/or the desired state areGaussian random variables.
 6. The method according to claim 1, whereinthe controller is a multivariate controller.
 7. The method according toclaim 1, wherein: a first step of optimizing said at least one parameterby said optimization of said cost function with respect to said at leastone parameter, a second step of controlling said system by saidcontroller using said control policy parametrized by said optimized atleast one parameter, and a third step of updating said Gaussian processmodel based on a recorded reaction of said system during said secondstep are carried out iteratively.
 8. The method according claim 1,wherein the system comprises an actuator and/or a robot.
 9. The methodaccording to claim 1, wherein said system is controlled by saidcontroller, the control policy of which has been devised by the method.10. The method according to claim 1, wherein a training system fordevising an optimum control policy of a controller is configured tocarry out the method.
 11. The method according to claim 1, wherein acontrol system for controlling a system is configured to carry out themethod.
 12. The method according to claim 1, wherein a computer programcontains instructions which cause a processor to carry out the method ifthe computer program is executed by said processor.
 13. The methodaccording to claim 12, wherein a machine-readable storage medium isconfigured to store the computer program.